Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic subscription scalability fix #1217

Merged
merged 5 commits into from
Dec 21, 2017

Conversation

mikepurvis
Copy link
Member

Switching to using epoll system calls to improve performance of the topic polling code by a factor of 2.

See previous discussions and rationale in #1162 and clearpathrobotics#12. Closes #1078.

mikepurvis pushed a commit that referenced this pull request Nov 8, 2017
Switching to using epoll system calls to improve performance of the topic polling code by a factor of 2.
@mikepurvis
Copy link
Member Author

Reproducing some comments from the previous thread. In particular, regarding the addDelMultiThread unit test failure (from @guillaumeautran):

After further investigations, it appears that the original code (prior to my change) was behaving differently when multiple handlers are registered for the same socket. Based on the code here: poll_set.cpp#L76 multiple handlers are added to the internal data structure but when registering for events on the socket, poll_set.cpp#L125, only the first handler is marked to receive the event.
This actually IMHO is a bug of the original implementation because the event handler call that is supposed to receive one type of event may not be able to handle all events correctly (write event on a read-only handler, etc). In addition, deleting the event from the handler will remove the flag from the first handler found in the list, poll_set.cpp#L144, which might not be correct either as that particular handler may still be looking to receive that particular event.

IMHO, the proper behaviour with multiple event handlers should really be such that the handler registering for a particular event should be the one receiving that event. For example, a single handler could be used for socket reads while a separate handler could be used for socket writes.

Because the code still contains all three versions (Windows, poll, epoll), I suggest we disable this particular unit test and leave it as that.

Regarding ABI breakage and whether this should go into Lunar or wait for Melodic (from me):

I suppose we cannot be completely assured [that no one is using them], but the two affected headers (ros/poll_set.h and ros/io.h) are pretty clearly deep implementation details, and there's not a puff about either of them in the wiki API documentation, nor are the classes involved meant to be extended— PollSet has no virtual methods, and there's no other functions which accept a PollSet instance.

I think the main risk with PollSet is that there could be a member of that type in an API class, which would cause the size of that class to change. However, on a quick grep through the repo, it looks like all instances (TransportTCP, TransportUDP, PollManager), are maintained as pointers or references, so those are safe from size change issues.

It would be great to get some closure on both these points so we can plan for our coming releases.

if (revents == 0)
{
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check / continue stay above the declaration of func, transport ad events as it was before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can but it really has very little impact since we are talking about variable declaration. I do prefer to have variable declaration at the top of the scope block for clarity purposes. But that's just me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can stay as-is in this case. I usually aim to keep patches minimal. While there is for sure plenty of code which would benefit cleanup any change makes the review more difficult, has a chance of regressions, and makes backporting patches more effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Thanks.

@dirk-thomas
Copy link
Member

Because the code still contains all three versions (Windows, poll, epoll), I suggest we disable this particular unit test and leave it as that.

I am usually not in favor of something like this. I would prefer to address the problem of the test and/or the implementation first and then apply a significant change like this. Simply because disabling tests without a replacement when introducing changes has a very bad taste.

Regarding ABI breakage and whether this should go into Lunar or wait for Melodic

I am leaning towards the conservative choice - especially after the "nightmare" of the latest round of backports into Kinetic (which took several fixes, re-releases and a significant amount of time). But I can see how the benefit would be good to have in the LTS release.

I will give others a chance to state their opinion here. If the consensus is to just ditch the problematic test and conclude that the benefit of the feature outweighs the regression risk as well as the ABI breakage I am happy to merge the patch (and after being released into Lunar for a bit consider it for backporting into Kinetic). @ros/ros_team Please add your comments on this.

@mikepurvis mikepurvis changed the title Topic subscription scalability fix (#1078) (#12) Topic subscription scalability fix Nov 8, 2017
@dirk-thomas
Copy link
Member

@ros/ros_team Waiting for feedback.

@tfoote
Copy link
Member

tfoote commented Nov 15, 2017

This does seem like a big change and should be quite conservative about backporting. Possibly waiting a few cycles for validation in real use cases on Lunar. But if shown to work well it also has large enough benefits to validate a backport effort.

Switching to using epoll system calls to improve performance of the topic polling code by a factor of 2.
@dirk-thomas
Copy link
Member

@tfoote The question was not if this should be backported to Kinetic. The question is if it should only go into Melodic or if it should be released into Lunar even though it it is a big change and changes some headers (which are argued for not likely being used by user land code).

@tfoote
Copy link
Member

tfoote commented Nov 15, 2017

Implicit in my response was to go ahead and release it on Lunar.

@dirk-thomas
Copy link
Member

Since nobody else commented on this ticket @mikepurvis can you please update this so that the PR job becomes green (e.g. by disabling the test in question as suggested above). While I don't think it is a good choice I don't want to stand in the way to get this merged.

@mikepurvis
Copy link
Member Author

Test disabled, with appropriate comment which I believe captures the situation.

That said, we should probably decide now whether there would be an intention to resolve the discrepancy in the future and reenable the test in which case file a ticket for that work, or plan to leave it as-is, in which case the test should probably be completely removed rather than just disabled (disabled implies a temporary action, per gtest docs).

@mikepurvis
Copy link
Member Author

Looks like the Jenkins xUnit plugin is still marking an unstable build on account of the disabled test, so it will need to actually be removed in order to get a green light.

@dirk-thomas
Copy link
Member

@ros-pull-request-builder retest this please

@mikepurvis
Copy link
Member Author

Recognizing multiple contributors here, please merge without squashing.

@dirk-thomas
Copy link
Member

I would rather squash the changes if there is no strong reason to keep the various authors since it makes potential backporting / rolling back much easier.

@mikepurvis
Copy link
Member Author

Okay, squashing and merging now.

@mikepurvis mikepurvis merged commit 9c0db37 into ros:lunar-devel Dec 21, 2017
dirk-thomas pushed a commit that referenced this pull request Feb 9, 2018
Switching to using epoll system calls to improve performance of the topic polling code by a factor of 2. This required disabling the addDelMultiThread test.
@mikepurvis mikepurvis deleted the scalability-fix branch April 24, 2018 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants